# Initialize Otter
import otter
grader = otter.Notebook("hw2-seda.ipynb")Background
import numpy as np
import pandas as pd
import altair as alt
# disable row limit for plotting
alt.data_transformers.disable_max_rows()
# uncomment to ensure graphics display with pdf export
# alt.renderers.enable('mimetype')DataTransformerRegistry.enable('default')
Gender achievement gaps in education have been well-documented over the years – studies consistently find boys outperforming girls on math tests and girls outperforming boys on reading and language tests. A particularly controversial article was published in Science in 1980 arguing that this pattern was due to an ‘innate’ difference in ability (focusing on mathematics rather than on reading and language). Such views persisted in part because studying systematic patterns in achievement nationwide was a challenge due to differential testing standards across school districts and the general lack of availability of large-scale data.
It is only recently that data-driven research has begun to reveal socioeconomic drivers of achievement gaps. The Standford Educational Data Archive (SEDA), a publicly available database on academic achievement and educational opportunity in U.S. schools, has supported this effort. The database is part of a broader initiave aiming to improve educational opportunity by enabling researchers and policymakers to identify systemic drivers of disparity.
SEDA includes a range of detailed data on educational conditions, contexts, and outcomes in school districts and counties across the United States. It includes measures of academic achievement and achievement gaps for school districts and counties, as well as district-level measures of racial and socioeconomic composition, racial and socioeconomic segregation patterns, and other features of the schooling system.
The database standardizes average test scores for schools 10,000 U.S. school districts relative to national standards to allow comparability between school districts and across grade levels and years. The test score data come from the U.S. Department of Education. In addition, multiple data sources (American Community Survey and Common Core of Data) are integrated to provide district-level socioeconomic and demographic information.
A study of the SEDA data published in 2018 identified the following persistent patterns across grade levels 3 - 8 and school ears from 2008 through 2015: * a consistent reading and language achievement gap favoring girls; * no national math achievement gap on average; and * local math achievement gaps that depend on the socioeconomic conditions of school districts. You can read about the main findings of the study in this brief NY Times article.
Below, we’ll work with selected portions of the database. The full datasets can be downloaded here.
Assignment objectives
In this assignment, you’ll explore achievement gaps in California school districts in 2018, reproducing the findings described in the article above on a more local scale and with the most recent SEDA data. You’ll practice the following:
- review of data documentation
- assessment of sampling design and scope of inference
- data tidying operations
- slicing and filtering
- merging multiple data frames
- pivoting tables
- renaming and reordering variables
- constructing exploratory graphics and visualizing trends
- data aggregations
- narrative summary of exploratory analysis
Import and assessment of datasets
You’ll work with test data and socioeconomic covariates aggregated to the school district level. These data are stored in two separate tables. Here you’ll examine them and review data documentation.
Test score data
The first few rows of the test data are shown below. The columns are:
| Column name | Meaning |
|---|---|
sedalea |
District ID |
grade |
Grade level |
stateabb |
State abbreviation |
sedaleaname |
District name |
subject |
Test subject |
cs_mn_... |
Estimated mean test score |
cs_mnse_... |
Standard error for estimated mean test score |
totgyb_... |
Number of individual tests used to estimate the mean score |
# import seda data
ca_main = pd.read_csv('data/ca-main.csv')
ca_cov = pd.read_csv('data/ca-cov.csv')
# preview test score data
ca_main.head(3)| sedalea | grade | stateabb | sedaleaname | subject | cs_mn_all | cs_mnse_all | totgyb_all | cs_mn_asn | cs_mnse_asn | ... | totgyb_whg | cs_mn_wht | cs_mnse_wht | totgyb_wht | cs_mn_wmg | cs_mnse_wmg | totgyb_wmg | cs_mn_wng | cs_mnse_wng | totgyb_wng | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 600001 | 4 | CA | ACTON-AGUA DULCE UNIFIED ... | mth | -0.367007 | 0.108543 | 86.0 | NaN | NaN | ... | 79.0 | -0.208654 | 0.165783 | 35.0 | -0.089003 | 0.518066 | 38.0 | NaN | NaN | NaN |
| 1 | 600001 | 4 | CA | ACTON-AGUA DULCE UNIFIED ... | rla | 0.005685 | 0.117471 | 85.0 | NaN | NaN | ... | 78.0 | 0.259587 | 0.189614 | 35.0 | 0.526942 | 0.602989 | 38.0 | NaN | NaN | NaN |
| 2 | 600001 | 6 | CA | ACTON-AGUA DULCE UNIFIED ... | rla | -0.000040 | 0.092172 | 114.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 rows × 59 columns
The test score means for each district are named cs_mn_... with an abbreviation indicating subgroup (such as mean score for all cs_mean_all, for boys cs_mean_mal, for white students cs_mn_wht, and so on). Notice that these are generally small-ish: decimal numbers between -0.5 and 0.5.
These means are estimated from a number of individual student tests and standardized relative to national averages. They represent the number of standard deviations by which a district mean differs from the national average. So, for instance, the value cs_mn_all = 0.1 indicates that the district average is estimated to be 0.1 standard deviations greater than the national average on the corresponding test and at the corresponding grade level.
Question 1: Interpreting test score values
Interpret the average math test score for all 4th grade students in Acton-Agua Dulce Unified School District (the first row of the dataset shown above).
Type your answer here, replacing this text.
SOLUTION In Acton-Agua Dulce Unified School District, the mean math test score for fourth graders in 2018 is estimated to be 0.37 standard deviations below the national average.
Covariate data
The first few rows of the covariate data are shown below. The column information is as follows:
| Column name | Meaning |
|---|---|
sedalea |
District ID |
grade |
Grade level |
sedaleanm |
District name |
urban |
Indicator: is the district in an urban locale? |
suburb |
Indicator: is the district in a suburban locale? |
town |
Indicator: is the district in a town locale? |
rural |
Indicator: is the district in a rural locale? |
locale |
Description of district locale |
| Remaining variables | Demographic and socioeconomic measures |
ca_cov.head(3)| sedalea | grade | sedaleanm | urban | suburb | town | rural | locale | perind | perasn | ... | snapall | snapblk | snaphsp | snapwht | single_momall | single_momblk | single_momhsp | single_momwht | seswhtblk | seswhthsp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 600001 | 4.0 | ACTON-AGUA DULCE UNIFIED ... | 0.0 | 0.0 | 0.0 | 1.0 | Rural, Distant | 0.003893 | 0.045901 | ... | 0.035165 | 0.20293 | 0.0819 | 0.032362 | 0.084385 | 0.349636 | 0.198482 | 0.061653 | 1.839339 | 0.692566 |
| 1 | 600001 | 5.0 | ACTON-AGUA DULCE UNIFIED ... | 0.0 | 0.0 | 0.0 | 1.0 | Rural, Distant | 0.003788 | 0.046652 | ... | 0.035165 | 0.20293 | 0.0819 | 0.032362 | 0.084385 | 0.349636 | 0.198482 | 0.061653 | 1.839339 | 0.692566 |
| 2 | 600001 | 6.0 | ACTON-AGUA DULCE UNIFIED ... | 0.0 | 0.0 | 0.0 | 1.0 | Rural, Distant | 0.003218 | 0.043657 | ... | 0.035165 | 0.20293 | 0.0819 | 0.032362 | 0.084385 | 0.349636 | 0.198482 | 0.061653 | 1.839339 | 0.692566 |
3 rows × 60 columns
You will only be working with a handful of the demographic and socioeconomic measures, so you can put off getting acquainted with those until selecting a subset of variables.
Question 2: Data semantics
In the non-public data, observational units are students – test scores are measured for each student. However, in the SEDA data you’ve imported, scores are aggregated to the district level by grade. Let’s regard estimated test score means for each grade as distinct variables, so that an observation consists in a set of estimated means for different grade levels and groups. In this view, what are the observational units in the test score dataset? Are they the same or different for the covariate dataset?
Type your answer here, replacing this text.
SOLUTION The observational units are school districts. They are the same for each dataset.
Question 3: Sample sizes
How many observational units are in each dataset? Count the number of units in the test dataset and the number of units in the covariate dataset separately. Store the values as ca_cov_units and ca_main_units, respectively.
(Hint: use .nunique().)
ca_cov_units = ca_cov.sedalea.nunique() # SOLUTION
ca_main_units =ca_main.sedalea.nunique() # SOLUTION
print('units in covariate data: ', ca_cov_units)
print('units in test score data: ', ca_main_units)units in covariate data: 913
units in test score data: 872
grader.check("q3")Question 4: Sample characteristics and scope of inference
Answer the questions below about the sampling design in a short paragraph. You do not need to dig through any data documentation in order to resolve these questions.
- What is the relevant population for the datasets you’ve imported?
- About what proportion (to within 0.1) of the population is captured in the sample? (Hint: have a look at this website.)
- Considering that the sampling frame is not identified clearly, what kind of dataset do you suspect this is (e.g., administrative, data from a ‘typical sample’, census, etc.)?
- Considering that the sampling frame is not identified clearly, what kind of dataset do you suspect this is (e.g., administrative, data from a ‘typical sample’, census, etc.)?
- In light of your description of the sample characteristics, what is the scope of inference for this dataset?
Type your answer here, replacing this text.
SOLUTION
The population of interest is all school districts in California in 2018. There are ~1000 school districts in total in 2022-2023, and although districts change a little from year to year, the total number doesn’t change too much in adjacent years. So this sample from 2018 likely covers about 90% of the population. The dataset is best described as administrative data: it is not a census, because the entire population is not included; and it is not a ‘typical’ sample or a random sample because there is no random selection mechanism from a well-defined frame. This data does not support inference and should be used for summary purposes only – conclusions should not be generalized beyond the sample. However, this actually isn’t much of a limitation here because the sample covers so much of the population.
Data tidying
Since you’ve already had some guided practice doing this in previous assignments, you’ll be left to fill in a little bit more of the details on your own in this assignment. You’ll work with the following variables from each dataset:
- Test score data
- District ID
- District name
- Grade
- Test subject
- Estimated male-female gap
- Covariate data
- District ID
- Locale
- Grade
- Socioeconomic status (all demographic groups)
- Log median income (all demographic groups)
- Poverty rate (all demographic groups)
- Unemployment rate (all demographic groups)
- SNAP benefit receipt rate (all demographic groups)
Question 5: Variable names of interest
Download the codebooks by opening the ‘data’ directory from your Jupyter Lab file navigator and downloading the codebook files. Identify the variables listed above, and store the column names in lists named main_vars and cov_vars.
# store variable names of interest
main_vars = ['sedalea', 'grade', 'sedaleaname', 'subject', 'cs_mn_mfg'] # SOLUTION
cov_vars = ['sedalea', 'locale', 'grade', 'sesall', 'lninc50all', 'povertyall', 'unempall', 'snapall'] # SOLUTIONgrader.check("q5")Question 6: Slice columns
Use your result from above to slice the columns of interest from the covariate and test score data. Store the resulting data frames as main_sub and cov_sub (for ‘subset’).
# slice columns to select variables of interest
main_sub = ca_main.loc[:, main_vars] # SOLUTION
cov_sub = ca_cov.loc[:, cov_vars] # SOLUTIONgrader.check("q6")In the next step you’ll merge the covariate data with the test score data. In order to do this, you can use the pd.merge(A, B, how = ..., on = SHARED_COLS) function, which will match the rows of A and B based on the shared columns SHARED_COLS. If how = 'left', then only rows in A will be retained in the output (so B will be merged to A); conversely, if how = 'right', then only rows in B will be retained in the output (so A will be merged to B).
A simple example of the use of pd.merge is illustrated below:
# toy data frames
A = pd.DataFrame(
{'shared_col': ['a', 'b', 'c'],
'x1': [1, 2, 3],
'x2': [4, 5, 6]}
)
B = pd.DataFrame(
{'shared_col': ['a', 'b'],
'y1': [7, 8]}
)A| shared_col | x1 | x2 | |
|---|---|---|---|
| 0 | a | 1 | 4 |
| 1 | b | 2 | 5 |
| 2 | c | 3 | 6 |
B| shared_col | y1 | |
|---|---|---|
| 0 | a | 7 |
| 1 | b | 8 |
Below, if A and B are merged retaining the rows in A, notice that a missing value is input because B has no row where the shared column (on which the merging is done) has value c. In other words, the third row of A has no match in B.
# left join
pd.merge(A, B, how = 'left', on = 'shared_col')| shared_col | x1 | x2 | y1 | |
|---|---|---|---|---|
| 0 | a | 1 | 4 | 7.0 |
| 1 | b | 2 | 5 | 8.0 |
| 2 | c | 3 | 6 | NaN |
If the direction of merging is reversed, and the row structure of B is dominant, then the third row of A is dropped altogether because it has no match in B.
# right join
pd.merge(A, B, how = 'right', on = 'shared_col')| shared_col | x1 | x2 | y1 | |
|---|---|---|---|---|
| 0 | a | 1 | 4 | 7 |
| 1 | b | 2 | 5 | 8 |
Question 7: Merge
Merge the covariate and test score data on both the district ID and grade level columns, and retain only the columns from the test score data (meaning, merge the covariate data to the test score data). Store the resulting data frame as rawdata and print the first four rows.
# merge covariates with gap data
rawdata = pd.merge(cov_sub, main_sub, how = 'right', on = ['sedalea', 'grade']) # SOLUTION
# print first four rows
rawdata.head(4) # SOLUTION| sedalea | locale | grade | sesall | lninc50all | povertyall | unempall | snapall | sedaleaname | subject | cs_mn_mfg | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 600001 | Rural, Distant | 4.0 | 1.237209 | 11.392048 | 0.091894 | 0.048886 | 0.035165 | ACTON-AGUA DULCE UNIFIED ... | mth | NaN |
| 1 | 600001 | Rural, Distant | 4.0 | 1.237209 | 11.392048 | 0.091894 | 0.048886 | 0.035165 | ACTON-AGUA DULCE UNIFIED ... | rla | NaN |
| 2 | 600001 | Rural, Distant | 6.0 | 1.237209 | 11.392048 | 0.091894 | 0.048886 | 0.035165 | ACTON-AGUA DULCE UNIFIED ... | rla | NaN |
| 3 | 600001 | Rural, Distant | 8.0 | 1.237209 | 11.392048 | 0.091894 | 0.048886 | 0.035165 | ACTON-AGUA DULCE UNIFIED ... | mth | -0.562855 |
grader.check("q7")Question 8: Rename and reorder columns
Now rename and rearrange the columns of rawdata so that they appear in the following order and with the following names:
- District ID, District, Locale, log(Median income), Poverty rate, Unemployment rate, SNAP rate, Socioeconomic index, Grade, Subject, Gender gap
Store the resulting data frame as rawdata_mod1 and print the first four rows.
(Hint: first define a dictionary to map the old names to the new ones; then create a list of the new names specified in the desired order; then use .rename() and .loc[]. You can follow the renaming steps in HW1 as an example if needed.)
# define dictionary mapping for renaming columns
# BEGIN SOLUTION
name_dict = {
'sedalea': 'District ID',
'sedaleaname': 'District',
'sesall': 'Socioeconomic index',
'lninc50all': 'log(Median income)',
'povertyall': 'Poverty rate',
'unempall': 'Unemployment rate',
'snapall': 'SNAP rate',
'cs_mn_mfg': 'Gender gap',
'subject': 'Subject',
'locale': 'Locale',
'grade': 'Grade'
}
# END SOLUTION
# specify order of columns
# BEGIN SOLUTION
name_order = ['District ID', 'District', 'Locale',
'log(Median income)', 'Poverty rate', 'Unemployment rate', 'SNAP rate',
'Socioeconomic index', 'Grade', 'Subject', 'Gender gap']
# END SOLUTION
# rename and reorder
# BEGIN SOLUTION
rawdata_mod1 = rawdata.rename(
columns = name_dict
).loc[:, name_order]
# END SOLUTION
# print first four rows
rawdata_mod1.head(4) # SOLUTION| District ID | District | Locale | log(Median income) | Poverty rate | Unemployment rate | SNAP rate | Socioeconomic index | Grade | Subject | Gender gap | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 600001 | ACTON-AGUA DULCE UNIFIED ... | Rural, Distant | 11.392048 | 0.091894 | 0.048886 | 0.035165 | 1.237209 | 4.0 | mth | NaN |
| 1 | 600001 | ACTON-AGUA DULCE UNIFIED ... | Rural, Distant | 11.392048 | 0.091894 | 0.048886 | 0.035165 | 1.237209 | 4.0 | rla | NaN |
| 2 | 600001 | ACTON-AGUA DULCE UNIFIED ... | Rural, Distant | 11.392048 | 0.091894 | 0.048886 | 0.035165 | 1.237209 | 6.0 | rla | NaN |
| 3 | 600001 | ACTON-AGUA DULCE UNIFIED ... | Rural, Distant | 11.392048 | 0.091894 | 0.048886 | 0.035165 | 1.237209 | 8.0 | mth | -0.562855 |
grader.check("q8")Question 9: Pivot
Notice that the Gender gap column contains the values of two variables: the gap in estimated mean test scores for math tests, and the gap in estimated mean test scores for reading and language tests. To put the data in tidy format, use .pivot and .rename() to pivot the table so that the gender gap column is spread into two columns named Math gap and Reading gap. Store the result as seda_data and print the first four rows.
Hint: to avoid unweildy column indexing, make sure you specify a values = ... argument when using .pivot(). Doing so will result in the column index being named Subject; remove this name in your solution.
Aside: an alternative solution is to manipulate the indices and use .unstack(), but this method will produce a dataframe with hierarchical column indexing (you’ll see) in which Subject is retained as a lower-level index; this will need to be collapsed in order to rename the columns as instructed using MultiIndex.droplevel() or similar.
# pivot to unstack gender gap (fixing tidy issue: multiple variables in one column)
# BEGIN SOLUTION
seda_data = rawdata_mod1.pivot(
index = name_order[0:9],
columns = 'Subject',
values = 'Gender gap'
).rename(
columns = {'mth': 'Math gap', 'rla': 'Reading gap'}
).reset_index()
seda_data.columns.name = None
# alternative solution using index manipulations
# seda_data = rawdata_mod1.set_index(name_order[0:10]).unstack(-1).reset_index() # option 1
# idx = seda_data.columns.droplevel(1)
# idx.values[9:11] = ["Math gap", "Reading gap"]
# seda_data.columns = idx
# END SOLUTION
# print first four rows
seda_data.head(4) # SOLUTION| District ID | District | Locale | log(Median income) | Poverty rate | Unemployment rate | SNAP rate | Socioeconomic index | Grade | Math gap | Reading gap | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 600001 | ACTON-AGUA DULCE UNIFIED ... | Rural, Distant | 11.392048 | 0.091894 | 0.048886 | 0.035165 | 1.237209 | 4.0 | NaN | NaN |
| 1 | 600001 | ACTON-AGUA DULCE UNIFIED ... | Rural, Distant | 11.392048 | 0.091894 | 0.048886 | 0.035165 | 1.237209 | 6.0 | NaN | NaN |
| 2 | 600001 | ACTON-AGUA DULCE UNIFIED ... | Rural, Distant | 11.392048 | 0.091894 | 0.048886 | 0.035165 | 1.237209 | 8.0 | -0.562855 | -0.785321 |
| 3 | 600006 | ROSS VALLEY ELEMENTARY ... | Suburb, Large | 11.607236 | 0.041418 | 0.048269 | 0.028006 | 1.912972 | 4.0 | -0.025131 | -0.521408 |
grader.check("q9")Your final dataset should match the dataframe below. You can use this to check your answer and revise any portions above that lead to different results.
# intended result
data_reference = pd.read_csv('data/tidy-seda-check.csv')
data_reference| District ID | District | Locale | log(Median income) | Poverty rate | Unemployment rate | SNAP rate | Socioeconomic index | Grade | Math gap | Reading gap | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 600001 | ACTON-AGUA DULCE UNIFIED ... | Rural, Distant | 11.392048 | 0.091894 | 0.048886 | 0.035165 | 1.237209 | 4.0 | NaN | NaN |
| 1 | 600001 | ACTON-AGUA DULCE UNIFIED ... | Rural, Distant | 11.392048 | 0.091894 | 0.048886 | 0.035165 | 1.237209 | 6.0 | NaN | NaN |
| 2 | 600001 | ACTON-AGUA DULCE UNIFIED ... | Rural, Distant | 11.392048 | 0.091894 | 0.048886 | 0.035165 | 1.237209 | 8.0 | -0.562855 | -0.785321 |
| 3 | 600006 | ROSS VALLEY ELEMENTARY ... | Suburb, Large | 11.607236 | 0.041418 | 0.048269 | 0.028006 | 1.912972 | 4.0 | -0.025131 | -0.521408 |
| 4 | 600006 | ROSS VALLEY ELEMENTARY ... | Suburb, Large | 11.607236 | 0.041418 | 0.048269 | 0.028006 | 1.912972 | 5.0 | 0.143163 | -0.210279 |
| 5 | 600006 | ROSS VALLEY ELEMENTARY ... | Suburb, Large | 11.607236 | 0.041418 | 0.048269 | 0.028006 | 1.912972 | 6.0 | 0.065459 | -0.034993 |
| 6 | 600006 | ROSS VALLEY ELEMENTARY ... | Suburb, Large | 11.607236 | 0.041418 | 0.048269 | 0.028006 | 1.912972 | 7.0 | NaN | -0.203610 |
| 7 | 600011 | FORT SAGE UNIFIED ... | Rural, Distant | 10.704570 | 0.159981 | 0.066333 | 0.102054 | -0.478127 | 4.0 | -0.015417 | -0.191400 |
| 8 | 600011 | FORT SAGE UNIFIED ... | Rural, Distant | 10.704570 | 0.159981 | 0.066333 | 0.102054 | -0.478127 | 5.0 | NaN | NaN |
| 9 | 600011 | FORT SAGE UNIFIED ... | Rural, Distant | 10.704570 | 0.159981 | 0.066333 | 0.102054 | -0.478127 | 7.0 | NaN | NaN |
Question 10: Sanity check
Ensure that your tidying did not inadvertently drop any observations: count the number of units in seda_data. Does this match the number of units represented in the original test score data ca_main? Store these values as data_units and ca_main_units, respectively.
(Hint: use .nunique().)
# number of districts in tidied data compared with raw
data_units = seda_data['District ID'].nunique() # SOLUTION
ca_main_units = ca_main.sedalea.nunique() # SOLUTIONgrader.check("q10")Question 11: Missing values
Gap estimates were not calculated for certain grades in certain districts due to small sample sizes (not enough individual tests recorded). Answer the following:
- What proportion of rows are missing for each of the reading and math gap variables? Store these values as
math_missingandreading_missing, respectively.
- What proportion of rows are missing for each of the reading and math gap variables? Store these values as
- What proportion of districts (not rows!) have missing gap estimates for one or both test subjects for at least one grade level?Store the value as
district_missing.
- What proportion of districts (not rows!) have missing gap estimates for one or both test subjects for at least one grade level?Store the value as
# proportion of missing values
math_missing, reading_missing = seda_data.loc[:, seda_data.columns.str.endswith('gap')].isna().mean() # SOLUTION
# proportion of districts with missing values
# BEGIN SOLUTION
district_missing = np.mean(
seda_data.set_index(
'District ID'
).loc[:, ['Math gap', 'Reading gap']].mean(
axis = 1,
skipna = False
).isna().reset_index().groupby(
'District ID'
).sum() > 0, axis = 0).values[0]
# END SOLUTIONgrader.check("q11")Question 12: Missing mechanism
Do you expect that this missingness is more likely for some districts than for others? If so, explain; why is this, and is bias a concern if missing values are dropped?
Type your answer here, replacing this text.
SOLUTION Since missing estimates arise from small sample sizes, missing values are more likely for districts with lower enrollment. Dropping these values would bias results towards larger districts, however, since the sample is nonrandom anyway, this is not a formal concern over and above existing inferential limitations.
Exploratory graphics
For the purpose of visualizing the relationship between estimated gender gaps and socioeconomic variables, you’ll find it more helpful to store a non-tidy version of the data. The cell below rearranges the dataset so that one column contains an estimated gap, one column contains the value of a socioeconomic variable, and the remaining columns record the gap type and variable identity.
Ensure that your results above match the reference dataset before running this cell.
# format data for plotting
plot_df = seda_data.melt(
id_vars = name_order[0:9],
value_vars = ['Math gap', 'Reading gap'],
var_name = 'Gap type',
value_name = 'Gap'
).melt(
id_vars = ['District ID', 'District', 'Locale', 'Gap type', 'Gap', 'Grade'],
value_vars = name_order[3:8],
var_name = 'Socioeconomic variable',
value_name = 'Measure'
)
# preview
plot_df.head()| District ID | District | Locale | Gap type | Gap | Grade | Socioeconomic variable | Measure | |
|---|---|---|---|---|---|---|---|---|
| 0 | 600001 | ACTON-AGUA DULCE UNIFIED ... | Rural, Distant | Math gap | NaN | 4.0 | log(Median income) | 11.392048 |
| 1 | 600001 | ACTON-AGUA DULCE UNIFIED ... | Rural, Distant | Math gap | NaN | 6.0 | log(Median income) | 11.392048 |
| 2 | 600001 | ACTON-AGUA DULCE UNIFIED ... | Rural, Distant | Math gap | -0.562855 | 8.0 | log(Median income) | 11.392048 |
| 3 | 600006 | ROSS VALLEY ELEMENTARY ... | Suburb, Large | Math gap | -0.025131 | 4.0 | log(Median income) | 11.607236 |
| 4 | 600006 | ROSS VALLEY ELEMENTARY ... | Suburb, Large | Math gap | 0.143163 | 5.0 | log(Median income) | 11.607236 |
Gender gaps and socioeconomic factors
The cell below generates a panel of scatterplots showing the relationship between estimated gender gap and socioeconomic factors for all grade levels by test subject. The plot suggests that the reading gap favors girls consistently across the socioeconomic spectrum – in a typical district girls seem to outperform boys by 0.25 standard deviations of the national average. By contrast, the math gap appears to depend on socioeconomic factors – boys only seem to outperform girls under better socioeconomic conditions.
# plot gap against socioeconomic variables by subject for all grades
fig1 = alt.Chart(plot_df).mark_circle(opacity = 0.1).encode(
y = 'Gap',
x = alt.X('Measure', scale = alt.Scale(zero = False), title = ''),
color = 'Gap type'
).properties(
width = 100,
height = 100
).facet(
column = alt.Column('Socioeconomic variable')
).resolve_scale(x = 'independent')
fig1